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Abstract 


Statistical machine translation for dialec¬ 
tal Arabic is characterized by a lack of data 
since data acquisition involves the transcrip¬ 
tion and translation of spoken language. In 
this study we develop techniques for ex¬ 
tracting parallel data for one particular di¬ 
alect of Arabic (Iraqi Arabic) from out-of- 
domain corpora in different dialects of Ara¬ 
bic or in Modern Standard Arabic. We com¬ 
pare two different data selection strategies 
(cross-entropy based and submodular selec¬ 
tion) and demonstrate that a very small but 
highly targeted amount of found data can 
improve the performance of a baseline ma¬ 
chine translation system. We furthermore 
report on preliminary experiments on us¬ 
ing automatically translated speech data as 
additional training data. 

1 Introduction 


In the Arabic-speaking world, dialectal Arabic 
(DA) is used side-by-side with the standard form 
of the language. Modern Standard Arabic (MSA). 
Whereas the latter is used for written and formal 
oral communication (lectures, speeches), DA is 
used for everyday, casual communication. DA is 
almost never written; exceptions are transcriptions 
of spoken language, e.g., in novels, movie scripts, 
or in online blogs or forums. DA and MSA exhibit 
strong differences at the lexical, phonological, mor¬ 
phological, and syntactic levels; furthermore, the 
dialects themselves form a similarity continuum 
that ranges from closely related to mutual unintelli¬ 
gible. An overview of the main characteristics of 


DA can be found in (Habash, 20101. 


*Work was done while the author was at SRI International. 


Most natural language processing (NLP) tools 
that have been developed for Arabic have been tar¬ 
geted towards MSA, for which large amounts of 
written data exist. NLP for DA suffers from a spar¬ 
sity of tools as well as data. Work on DA annotation 
tools includes the development of morphological 


analyzers for Arabic dialects (|Habasb et al., 2003} 


Habash et al., 2012 

Habash et al., 2013|), treebanks 

(Maamouri et al., 20061 and parsers (Chiang et al.. 

20061, unsupervised (Duh and Kirchhoff, 2005|l or 

supervised (Al-Sabbagh and Girju, 2012 

1 training 


of POS taggers for DA, and lexicon acquisition 


dDuh and Kirchhoff, 2006| l. However, most of these 
have been targeted to the Egyptian or Levantine 
dialects and do not easily generalize to other di¬ 
alects. There are a small number of speech and 
parallel text corpora for Egyptian, Levantine, and 
Iraqi DA, primarily available from the Linguistic 
Data Consortium (LDC) and the European Lan¬ 
guage Resources Association (ELRA). In general, 
however, spoken language needs to be recorded and 
transcribed to produce text data, wbicb constitutes 
a bottleneck for the rapid acquisition of new data. 


The lack of training data for DA in statistical ma¬ 
chine translation (SMT) has only been addressed 
in a few previous studies; the standard approach 
has been to simply collect more training data by 
transcribing and translating DA speech. ( |Zbib et 
al., 20121 compare utilizing large amounts of MSA 
data for training and creating a small corpus of DA 
training data. They conclude that simply adding 
large amounts of mismatched (MSA) training data 
does not help, whereas even a small amount of di¬ 
alectal data is very useful. Salloum and Habash 


(Salloum and Habash, 2011 Salloum and Habash 


20131 propose to transform DA to MSA by means 
of a combination of statistical processing and hand- 
coded transformation rules, and to then apply MT 




























systems for MSA-to-English. Their work was on 
Egyptian Arabic, and porting this approach to a dif¬ 
ferent dialect involves a fair amount of manual ef¬ 
fort and dialect expertise. In ( [Aminian et al., 20T4 1 
the specific problem of out-of-vocabulary words in 
MT for DA is addressed by replacing DA words 
with their MSA equivalents. 

In this paper we attempt to enrich available train¬ 
ing data for Iraqi Arabic by automatically iden¬ 
tifying lA-English parallel data in out-of-domain 
corpora of MSA and other dialects of Arabic. This 
procedure is based on the assumption that at least 
some dialects will exhibit similarities with lA. Cor¬ 
pora formally described as MSA may also contain 
dialectal data at the subsentential level due to code¬ 
switching (mixed use of MSA and DA), which is 
common among Arabic speakers. In principle, auto¬ 
matic dialect identification methods ([Alorifi, 2008 


Sadat et al., 20141 |Zaidan and Callison-Burch^ 


20141 might be used for this purpose; however. 


these methods are themselves error-prone and have 
not been developed for all dialects of Arabic. Our 
approach is to directly select data that is matched 
to features (n-grams) extracted from a sample cor¬ 
pus of the dialect of interest. In addition to finding 
dialecf-mafched dafa, fhe selecfed dafa is also likely 
fo be mafched wifh respecf fo fopic and sfyle. Two 
different data selection methods are investigated, 
the widely-used cross entropy method of (Moore| 


and Lewis, 20101, and a more recent submodular 


data selection method ( |Wei et al., 2013 1. We demon¬ 
strate that the performance of SMT systems for lA 
can be improved by selecting a very small amount 
of highly targeted out-of-domain data. In addition, 
we conduct a preliminary investigation of the possi¬ 
bility of using automatically translated speech data 
as SMT training data. 

The paper is structured as follows: we first report 
on previous work on data selection for SMT (Sec¬ 
tion]^. We then describe the submodular technique 
used in this paper in detail (Section|^. The data is 
described in Section]^ experiments are results are 
presented in Section We provide conclusions in 
Section |6] 


2 Data Selection: Previous Work 


A currently widely-used data selection method in 
SMT (which we also use as a baseline in Section]^ 
uses the cross-entropy between two language mod¬ 
els ( [Moore and Lewis, 2010| ), one trained on the 
test set of interest, and another trained on generic or 


out-of-domain training data. We call this the cross¬ 
entropy method. This method trains a test-set spe¬ 
cific (or in-domain) language model, LMin, and a 
generic (oul-of- or mixed-domain) language model, 
LMout- Each sentence x G V in the training data is 
scored by both language models and is assigned the 
log ratio of the language model probabilities as a 
score: 


= ^log[Pr(x|LMi„)/Pr(x|LMout)] (1) 

where l{x) is the length of sentence x. Sentences 
are then ranked in descending order based on their 
scores and the top N sentences are chosen. Vari¬ 
ous extensions to this method have been proposed. 


In (Axelrod et al., 20111 the monolingual selec¬ 
tion method is extended to bilingual corpora. In 


(Duh et al., 2013 1 , neural language models are used 


instead of backoff language models. Einally, (Me 


diani et al., 2014) propose a different method for 
drawing the out-of-domain sample and the use of 
word-association models to improve the data for 
training the out-of-domain language model. 

The cross-entropy approach ranks each sen¬ 
tence individually, without reference to other sen¬ 
tences. Thus, no sentence interactions can be mod¬ 
elled, such as redundancy at the sentential or sub- 
sentential level. Moreover, the method does not 
have a theoretical performance guarantee. 

3 Submodular Data Selection 


Submodular functions ( [Edmonds, 1970t Fujishige] 
20051 were first developed in mathematics, opera¬ 
tions research and economics; more recently, they 
have been used for a variety of optimization prob¬ 
lems in machine learning as well. For example, 
they have been applied to the problems of cluster¬ 


ing (Narasimhan and Bilmes, 20071, observation 


selection (Krause et al., 2008), sensor placement 


(Krause and Guestrin, 20111, or image segmenta 


tion (Jegelka and Bilmes, 2011). Within natural 
language processing (NLP) submodular functions 
have been used for extractive text summarization 


(Lin and Bilmes, 2012). 


To explain submodular functions, we introduce 
the following notation: assume a finite set of 
data elements V, the ground set. A valuation 
function / : 2^ —)• is then defined fhaf refurns a 
non-negafive real value for any subsef A C V. The 
function / is called submodular if it satisfies fhe 
property of diminishing ref urns: for all X C F and 








































V ^ Y, the following is true: 

/(XU{v})-/(X)>/(FU{v})-/(F). (2) 

This means that the incremental value (or gain) of 
element v decreases when the context in which v 
is considered grows from X to F. The “gain” is 
defined as f{v\X) = f{X U {v}) — f(X). Thus, / is 
submodular if f{v\X) > /(v|F). Submodularity is 
a natural model for data selection in SMT and other 
NLP tasks. The ground set V is the set of training 
data elements, and elements are selected from this 
set according to a submodular valuation function 
for any given subset of V. The value of this 
function diminishes for items that are (partially) 
redundant with other items in the already-selected 
subset, which is precisely the submodularity 
property. The specific function we utilize for the 
purpose of MT data selection is as follows: 

fiX)=Y,Wu^uiY,mu{x)) (3) 

ueU xex 

Here, 17 is a set of features (such as words, n-grams, 
etc.), X is a subset of P, w is a non-negative weight, 
0 is a non-negative, non-decreasing concave func¬ 
tion, and mu{x) is a score indicating how relevant 
u is in sample x. Thanks to the concave function, 
the contribution of each feature u in the context of 
an existing subset X diminishes as X grows. 

In our work the feature set U consists of all n- 
grams up to a pre-specified length drawn from a 
representative in-domain data set. The feature rele¬ 
vance scores m„(x) are the tf-idf weighted counts 
of the the features (n-grams). The tf-idf (term fre¬ 
quency, inverse document frequency) values are 
computed by treating each sentence as a “docu¬ 
ment”. That is, the weighting term is 


Algorithm 1: The Greedy Algorithm 

1 Input: Submodular function f : 2^ ^ IR+, 
cost vector m, budget b, finite set V. 

2 Output: X/i where k is the number of 
iterations. 

3 Set Xq i — 0 \ i i — 0 \ 

4 while m{Xi) <b do 

5 Choose Vi as follows: 

V/ G {argmax,g^,\^,. ; 

6 2^i+\ U {v,-} 1 ; 


Train 

Tune 

Dev 

Testl 

Test2 

Test3 

7.6M 

64kk 

29k 

8k 

10k 

9k 


Table 1: Size of Iraqi Arabic Transtac corpus parti¬ 
tions (in words). 


or parallel sentences to select. Solving this prob¬ 
lem exactly is NP-complete ( |Feige, 1998| l, and ex¬ 
pressing it as an ILP procedure renders it impracti¬ 
cal for large data sizes. When / is submodular 
and the cost is just size {m{X) = |X|), then the 
simple greedy algorithm (detailed in Algorithm 
1) will have a worst-case guarantee of /(X*) > 
(1 - l/c)/(Xopt) Ri 0.63/(Xopt) wher e Xppt is the 
optimal and X* is the greedy solution ([Nemhau^ 


et ah, 19781. This constant factor guarantee stays 
the same as n grows; thus, it scales well to large data 
sets. The application of this procedure to the selec¬ 
tion of training data for large-scale SMT tasks was 


described in (Kirchhoff and Bilmes, 20141. Here, 


we apply it in the same way to the selection of 
out-of-domain data for a small-scale task. 


4 Data 


tf-idf{u)=c{u,x)*log ' ' (4) 

c[u, V) 

where c{u,x) is the count of n in x (term frequency), 
and c(m, P) is the number of sentences out of P that 
u occurs in. 

The above function can be optimized efficiently 
even for large data sets. Formally, we have the 
following optimization problem: 

X*G argmax /(X), (5) 

XCV,m(X)<fo 

where b is a known budget - in the present con¬ 
text, the budget can be, e.g., the number of words 


The in-domain data available for the present study 
is the Transtac corpus of Iraqi Arabic; the sizes of 
the training, tuning and development test sets are 
shown in Table [U 

The out-of-domain data sources used for the se¬ 
lection experiments are listed in Table We utilize 
22 LDC corpora that include MSA and other di¬ 
alects of Arabic, notably Egyptian and Levantine. 
For example, training corpora developed for the 
GALE, TIDES, and BOLT projects were included, 
as were the Levantine Arabic Treebank, an Egyp¬ 
tian Arabic word alignment corpus, and a corpus 
of dialectal Arabic web data (75% Levantine, 25% 
Egyptian) that was translated through crowdsourc- 




























LDC ID 

Description 

Genre 

Dialect 

Size 

LDC2005E83 

GALE-YIQI 

BN, BC, WB 

MSA 

170k 

LDC2006E34 

GALE-Y1Q2 

BC, WB 

MSA 

126k 

LDC2006E39 

Tides MT05 Eval 

NW 

MSA 

135k 

LDC2006E44 

Tides MT04 Eval 

NW 

MSA 

170k 

LDC2006E85 

GALE-Y1Q3 

WB 

MSA 

18k 

LDC2006E92 

GALE-Y1Q4 

BN, WB 

MSA 

293k 

LDC20I2T06 

GALE-P2-BC 

BC 

MSA 

174k 

LDC2007EI0I 

GALE-P2R1 

BC, BN 

MSA 

337k 

LDC2007EI0I 

GALE-P3R1 

BC, NW, BN 

MSA 

530k 

LDC2007E46 

GALE-P2R2 

NW, WB 

MSA 

87k 

LDC2007E87 

GALE-P2R3 

BC, BN, NW, WB 

MSA 

188k 

LDC2008E40 

GALE-P3R2 

BC, BN 

MSA 

268k 

LDC2009EI5 

GALE-P4Rlv2 

WB, BC, BN, NW 

MSA 

305k 

LDC2009EI6 

GALE-P4R2 

BC, BN, NW, WB 

MSA 

273k 

LDC2009E95 

GALE-P4R3vL2 

BC, BN, NW, WB 

MSA 

147k 

LDC20I0E38 

GALE-P3 Treebank 

BC, NW 

MSA 

349k 

LDC20I0E79 

GALE-Levantine 

BC 

Levantine 

34k 

LDC20I0TI7 

NIST-OpenMT-2006 

NW, BC, BN, WB 

MSA 

141k 

LDC20I0T23 

NIST-OpenMT-2009 

NW, WB 

MSA 

129k 

LDC20I2EI9 

BOLT-P1-R2 MT Training Data 

DF 

Egyptian 

126k 

LDC20I2E5I 

BOLT-PI ARZ word alignments 

DF 

Egyptian 

55k 

LDC20I2T09 

Web translations 

Various 

Egyptian, Levantine 

L613M 

Total 

5.690M 


Table 2: List of out-of-domain corpora used for data selection. BN = broadcast news, BC = broadcast 
conversations, WB = web blogs, NW = newswire data, DF = discussion fomms. Sizes are given in number 
of source-language words after tokenization. 


ing (thus, translations are noisy). Note that even information), 
though a corpus may be officially listed as MSA, 
it may contain segments of DA, especially when 

broadcast conversations (e.g., talkshows) are in- 5.1 Initial evaluation of selection techniques 
eluded. 


5 Experiments and Results 


We use two different MT systems for translation 
from lA to English, an in-house system based 
on Moses and the SRI MT system developed for 
the DARPA BOLT (Broad Operational Language 
Translation) spoken dialog translation project (see 
( Ayan et al., 2013| Kirchhoff et al., 20151 for more 
details). The former is a flat phrase-based statistical 
MT system with a hierarchical lexicalized reorder¬ 
ing model and a 6-gram language model trained on 
the target side of the Transtac training data. For 
preprocessing we use a statistical morphological 
segmenter developed in the BOLT project. The 
second system is similar in nature but has a hierar¬ 
chical phrase-based translation model and utilizes 


sparse features (see (Zhao et al., 2014 1 for more 


In an initial set of experiments we attempted to 
gauge the performance of the cross-entropy vs. the 
submodular selection technique by subselecting the 
Transtac training data. We chose 10-40% of the 
Transtac training set; the feature set U was the set 
of all n-grams up to length 7 of the tune and dev 
sets. We investigated both translation directions, lA 
—)• English and English —lA. Table shows the 
BLEU scores. 

Compared to using 100% of the training data, the 
same or even better performance can be obtained 
by using a subset of the data when the submodular 
subselection technique is used, even at small per¬ 
centages of the training data. The cross-entropy 
method falls short of this performance, presumably 
due to the failure of this method to control for re¬ 
dundancy in the selected set. 














lA^EN 

EN^IA 

Size 

Xent 

SM 

Xent 

SM 

10% 

30.2 

32.3 

16.1 

17.9 

20% 

31.5 

32.5 

17.0 

17.7 

30% 

31.8 

32.5 

17.4 

17.5 

40% 

31.2 

32.6 

17.3 

17.5 

100% 

32.5 

16.2 


Table 3: BLEU scores on dev set for training data 
subselection using cross-entropy (Xent) vs. the sub- 
modular (SM) method. lA = Iraqi Arabic, EN = 
English. 



lA-EN 


BLEU (%) 

PER (%) 

Baseline 

33.5 

40.9 

Xent 

33.6 

40.9 

Submod 

33.7 

40.7 


EN-IA 


BLEU (%) 

PER (%) 

Baseline 

17.0 

57.1 

Xent 

17.1 

57.1 

Submod 

17.2 

56.8 


Table 4: BLEU and PER on dev set for system with 
additional out-of-domain data, in-house system. 

5.2 Selection of out-of-domain training data 

In order to integrate additional out-of-domain train¬ 
ing data, we set a budget constraint of 100k words 
on the source side. The LDC corpora were pre- 
processed in the same manner as the Transtac data, 
i.e. , they were preprocessed and morphologically 
segmented. The greedy algorithm was used in com¬ 
bination with Equation|^to select parallel sentences 
from the corpora listed in Table such that the re¬ 
sulting corpus contains at most 100k words on the 
source side. The selected data was then added to 
both the MT and LM training data. Table shows 
the BLEU scores and position-independent word 
error rate (PER) for the in-house MT system that 
was used for development purposes (note that base¬ 
line results are different from those in Table ^be¬ 
cause the baseline MT system changed in between 
experiments and was trained on different data set 
dehnitions and tokenization schemes). We again 
compared the cross-entropy against the submodular 
selection method. Improvements in the system are 
small; however, the submodular technique again 
shows slightly better results. 

We subsequently used the selected data with 


System 

dev 

testl 

test2 

test3 

base 

17.5 

35.2 

32.2 

33.1 

- 1 - 7k data 

17.5 

35.5 

32.7 

33.5 


Table 5: BLEU scores of EN-IA system, obtained 
with an additional 7k sentences of submodular- 
selected data, evaluation system. 

the submodular method in the second MT system, 
viz. the evaluation system developed for a bilingual 
dialog system, and tested the system on additional 
in-domain data sets. BLEU scores (shown in Table 
show slight improvements of up to 0.5 absolute. 
Note that the selected data set was very small, con¬ 
taining only 7k sentences. Larger sets (up to 20k) 
were tried but were not found to be useful. 

We analyzed the selected data as to its origin and 
found that the top three data sources were broadcast 
conversations from various GALE corpora (47%), 
the dialectal web corpus (35.7%), and the BOLT 
MT training data (9.9%). 

5.3 Using translated speech data 

In addition to the various parallel text corpora listed 
in Table we also had access to an Iraqi Arabic 
Conversational Telephone Speech (CTS) corpus 
(LDC2006T16). This corpus includes with speech 
transcriptions but no translations. Although the data 
matches the dialect of interest is is not necessarily 
matched in topic or style. To obtain parallel data 
we translated the transcriptions of this corpus with 
our baseline lA —)• EN translation system. Those 
segments that were translated contiguously (i.e., 
without intervening out-of-vocabulary words) were 
extracted and added to the data from the corpora 
in Table |2l Data selection was then re-run. We 
found that in this experiment 80% of the selected 
data came from the CTS corpus; however, the trans¬ 
lation performance did not improve (see Table [^. 
The likely reason is that translations were too noisy 
to be used as parallel data and introduced more 
confus ability and irrelevant variation rather than 
contributing useful translations. The use of auto¬ 
matically translated speech data might be improved 
by selecting only the most conhdent translations 
according to the translation model scores. 

6 Conclusion 

We have described data selection procedures for 
identifying Iraqi Arabic data resources in unrelated 
dialectal and/or MSA corpora. We have demon- 








































lA-EN 


BLEU (%) 

PER (%) 

Baseline 

33.5 

40.9 

Submod 

33.7 

40.7 

+ CTS 

33.8 

41.0 


EN-IA 


BLEU (%) 

PER (%) 

Baseline 

17.0 

57.1 

Submod 

17.2 

56.8 

+ CTS 

17.0 

57.5 


Table 6: BLEU and PER on dev set, system with 
additional out-of-domain data, including CTS, in- 
house system. 

strated that judiciously selected data can improve 
MT performance even when the overall amount is 
very small. Eurthermore, we have compared two 
different data selection techniques, the widely-used 
cross-entropy selection method, and a more recently 
developed method that relies on submodular func¬ 
tion optimization. The latter performed slightly 
better than the former. Finally, we have conducted 
initial experiments on utilizing automatically trans¬ 
lated conversational speech as additional training 
data. Whereas the data was strongly matched to the 
in-domain data on the source side, the translations 
were too noisy to yield any further improvement in 
machine translation performance. 
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